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Abstract — Biaxial motion stages are commonly used in precision motion applications. However , the contour following 
accuracy of a biaxial motion stage often suffers from system nonlinearities and external disturbances. To deal with the 
above-mentioned problem, a control scheme consisting of a reinforcement Q-learning controller with a self-tuning learning 
rate and two iterative learning controllers is proposed in this paper. In particular, the reinforcement Q-learning controller is 
used to compensate for friction and also cope with the problem of dynamics mismatch between different axes. In addition, 
one of the two iterative learning controllers is used to suppress periodic external disturbances, while the other one is 
employed to adjust the learning rate of the reinforcement Q-learning controller. Results of contour following experiments 
indicate that the proposed approach is feasible. 
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I. Introduction 

Contour following is commonly seen in industrial processes such as machining, cutting, polishing, deburring, painting and 
welding. In these industrial processes, product quality depends on contour following accuracy. Generally speaking, better 
contour following accuracy can be achieved by reducing tracking errors and/or contour error [14]. As a matter of fact, 
tracking error reduction is one of the most important research topics in the contour following problems of multi -axis motion 
stage [l]-[4]. Due to factors such as external disturbance, system nonlinearity, servo lag and mismatch in axis dynamics, 
contour following accuracy of the multi-axis motion stage may not be able to meet the accuracy requirements [5]-[7], 

There are many existing approaches that can be used in practice to reduce tracking error of a multi-axis motion stage [9]- 

[12] , For example, the commonly used multi-loop feedback control scheme with command feedforward is very effective in 
reducing tracking error caused by the servo lag phenomenon [8]. In addition, advanced control schemes such as sliding mode 
control and adaptive control can be used to reduce tracking error as well. Recently, the number of studies exploiting the 
paradigm of artificial neural network to improve contour following accuracy of multi-axis motion stage has risen steadily 

[13] -[22]. For instance. Wen and Cheng [13] proposed a fuzzy CMAC with a critic -based learning mechanism to cope with 
external disturbance and nonlinearity so as to reduce tracking error. Later on. Wen and Cheng [15] further proposed a 
recurrent fuzzy cerebellar model articulation controller with a self-tuning learning rate to improve contour following 
accuracy for a piezoelectric actuated dual-axis micro motion stage. In addition to tracking error reduction, the paradigm of 
artificial neural network has been applied to different fields such as wind power generation [24], the game of Go [22], and 
object grasping using robots [25]. Generally, a neural network needs to be trained before it can be used to solve a particular 
problem. Among different training mechanisms for neural networks, reinforcement learning is the one that has received a lot 
of attention recently [21], In this paper, a control scheme consisting of a reinforcement Q-learning controller with an 
adjustable learning rate and two iterative learning controllers (ILC) is proposed to improve contour following accuracy of a 
bi-axial motion stage. In the proposed approach, the reinforcement Q-learning controller is responsible for friction 
compensation and also deals with the dynamics mismatch between different axes. In addition, one of the two ILCs is 
exploited to deal with the adverse effects due to periodic external disturbances from repetitive motions, while the other ILC 
is exploited to tune the learning rate of Q-learning based on current tracking error so as to further improve contour following 
accuracy. 

The remainder of the paper is organized as follows. Section 2 gives a brief review on reinforcement learning and iterative 
learning control. Section 3 introduces the proposed control scheme. Experimental results and conclusions are provided in 
Section 4 and 5, respectively. 
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II. Brief Review on Reinforcement Learning and Iterative Learning Control 


Since the proposed control scheme exploits the idea of reinforcement learning and iterative learning control, brief reviews on 
these two research topics will be provided in this section. 

2.1 Reinforcement Learning and Q-Learning 

In general, learning mechanisms of artificial neural networks can be divided into three types: supervised learning, 
unsupervised learning and reinforcement learning. Unlike the other two types of learning which either need training pairs or 
expected final outcome, reinforcement learning “learns” proper actions by maximizing the reward simply based on the 
reward/penalty resulting from previous action and current environment. Fig. 1 illustrates a typical control block diagram that 
employs reinforcement Q-learning. 

In general, the reinforcement Q-learning controller can be expressed as: 


2 ,+ 1 (s, , a, ) = Q, (s„ a, ) + a, [r, + y max Q, +1 (s, +1 ,a ; )-Q, (s, , a,. )] 

a, 


( 1 ) 


Where Q,(s, ,a ( ): the Q value corresponding to the state s, and action a, in the Q-table; s: state; af. action; i: action index in 
action space; a learning rate; r. reward; y:discount factor. j s t jj e maximum value of Q corresponding to state 

.v, + iand action a,; t: time variable. 



FIG. 1 A TYPICAL CONTROL BLOCK DIAGRAM THAT EMPLOYS REINFORCEMENT BASED Q-LEARNING 

The probability of selecting action a, in Q-learning is described by (2). 


>,«j) 

( 2 ) 


P(s,a ) = 


2.2 Iterative Learning Controller 


As reported in many previous studies, ILC is effective in suppressing periodic disturbances caused by repetitive motions [17- 
19]. Fig. 2 is the block diagram for a control scheme consisting of a feedback controller and a control law based ILC [16]. In 
Fig. 2, Uac is the control force generated by ILC, L is the learning function and F is a low-pass filter. All the tracking error e x 
and the control force U, i c generated by ILC in the previous iteration are stored and used to update Uu c in the current iteration. 
The total control force Uj in the /th iteration to the plant G is the sum of U iic and the feedback controller output Up. 



Fig. 2 Block diagram for a control scheme consisting of a feedback controller and a 

CONTROL LAW BASED ILC 
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Based on Fig.2, the relationship between output y) in the /th iteration and input P cmd can be described as: 
y j =(l+GCy l G-U j +(l + GCY > GC-P cind 


(3) 


where the total control force Uj+i in the j+ 1th iteration is updated using Eq. (4) 


Uj+i = F{U j + Lej)+ Ce j+i 


(4) 


where e- n and Uj are the tracking error and total control force in the /th iteration, respectively. Note that in Eq. (4), Ce J+l can be 
regarded as the feedback control force. The control force Un c generated by ILC aims at reducing the tracking error. Namely, 
better performance of ILC leads to smaller tracking error so that feedback control force decreases as well. 


III. The Proposed Reinforcement Q-Learning Controller with an Adjustable Learning Rate 

Fig. 3 illustrates the block diagram for the control scheme consisting of a feedback controller, a control law based ILC, and 
the proposed reinforcement Q-learning controller with an adjustable rate. 
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Fig. 3 BLOCK DIAGRAM for the control scheme consisting of a feedback controller, a control 
LAW BASED ILC, AND THE PROPOSED REINFORCEMENT Q-LEARNING CONTROLLER WITH AN ADJUSTABLE 

RATE. 

In the proposed approach, the reinforcement Q-learning controller is modified as: 

Q, +l (e„a,) = Q,(e r a,) + L ilc [r, + y max Q, +l (e, +1 , a, ) - Q, (e, , a,. )] (5) 


Compared with Eq. (1), the learning rate a, in the conventional Q-learning algorithm is replaced by L Uc in Eq. (5), where L iic 
is updated using Eq. (6) 

V,,=F[L„ Cj .+Lq] (6) 

where L Uc j +l is the learning rate for the reinforcement Q-learning controller in the y+lth iteration. Note that in this paper, the 
value of Liu- is constrained to be between zero and one. Moreover, in this paper, the aim is to reduce the tracking error of a 
multi-axis motion stage. As a result, the state s in Eq. (1) is replaced by tracking error e in Eq. (5). In addition, three possible 
actions — accelerate, decelerate and maintain constant velocity, and can be selected for a, in Eq. (5) to adjust the velocity 
command for the motion stage. The probability of selecting action a, in the Q-learning algorithm is rewritten as: 

eo=,o ( ) 

pt e a ) - — (7) 

2> 

where the state is the tracking error e. 

In this paper, the action space A consists of three actions 

A = {a l ,a 2 ,a i ] 


where ay. accelerate: ay. decelerate; ay. maintain constant speed. 
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In this paper, the reward is designed to reduce tracking error. In particular, the reward is determined using Eq. (9). 
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Fig. 4 illustrates the block diagram of the motion control scheme for a bi-axial motion stage proposed in this paper. In Fig.4, 
the velocity command co cmix for the jr-axis consists of the control force U U( ._ X generated by ILC, the feedback control force 
and the control force U RL x generated by the reinforcement Q-learning controller. It can be expressed as: 


co. 


'cmd_x F fb_ 


+ U, r +U„ 


(10) 


where 


U iic_x = F ( u j + L - e j) 

Url_ x = v+ Av 


( 11 ) 

( 12 ) 

(13) 


Note that the velocity command for the y-axis is designed similarly. 

In Fig. 4, the reinforcement Q-learning controller with adjustable learning rate is responsible for friction compensation and 
also deals with the dynamics mismatch between the x-axis and y-axis. Fig. 4 also shows that two ILCs are employed in the 
proposed motion control scheme. In particular, one ILC is exploited to adjust the learning rate of the reinforcement Q- 
learning controller, while the other ILC is exploited to deal with the adverse effects due to periodic external disturbances so 
as to further reduce tracking error. 



Fig. 4 The block diagram of the proposed motion control scheme for a bi-axial motion stage. 

IV. Experimental Results 

Fig. 5 shows a the photograph of the bi-axial motion stage used to assess the effectiveness of the proposed approach. Fig. 6 (a) 
shows the circle-shaped contour represented in NURBS form used in the contour following experiment. Under S-curve 
acceleration/deceleration motion planning, a NURBS interpolator [23] is employed to convert the circle-shaped contour into 
the position commands for the x-axis (Fig. 6(b)) and y-axis (Fig. 6(c)). The duration time for each circle following is 9.5 
seconds. In each experiment, circle following will be performed seven times (i.e. seven iterations). In total, four different 
control schemes are tested in the contour following experiments. They are: 

Control scheme #1: PI type feedback controller combined with an ILC. 

Control scheme #2: PI type feedback controller combined with a reinforcement Q-learning controller with a fixed learning 
rate. 

Control scheme #3: PI type feedback controller combined with an ILC and a reinforcement Q-learning controller with a fixed 
learning rate. 
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Control scheme #4: PI type feedback controller combined with an ILC and a reinforcement Q-learning controller with 
adjustable learning rate. 

Due to the limitations in the paper length, only the experimental results of the tracking error in the jr-axis for these four tested 
control schemes are shown in Fig. 7. In addition, performance indices in terms of root mean square of tracking error (RMS), 
average of integral of absolute tracking error (AIAE), and maximum tracking error (MAX) are listed in TABLE 1. 



Fig. 5 Photograph of the bi- axial motion stage used in this paper. 





00 

Fig. 6 (a) Circle shaped contour represented in NURBS form (b) Position commands for x-axis, 

(c) Position commands for y-axis. 
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Fig. 7 Experimental results of tracking error in the x-axis (a) control scheme #1 (b) control 
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Fig. 8 The values of the learning rate of the reinforcement Q-learning controller varies 

WITH RESPECT TO TIME. 

Table 1 

Tracking error comparison among four tested control schemes 



Tracking error of X axis 

RMS(pm) 

AIAE(pm) 

Max(pm) 

Scheme #1 

19.22 

17.01 

35.79 

Scheme #2 

9.48 

5.16 

35.08 

Scheme #3 

4.28 

2.33 

17.63 

Scheme #4 

2.62 

1.51 

9.93 


Fig. 7(a) shows the tracking error of the circle following experiment using control scheme #1. Since the ILC will be activated 
after the 1 st iteration, the tracking error for the first 9.5 seconds in Fig. 7(a) can be regarded as the results for using the PI 
feedback control only. Clearly, tracking error gradually decreases after the 2 nd iteration, indicating that ILC indeed is 
effective in suppressing periodic external disturbance. Based on the experimental results shown in Fig. 7(b)~(d), the tracking 
error for using control schemes #2, #3 or #4 all converges much faster than that for using the control scheme #1. These facts 
indicate that the reinforcement Q-learning controller is indeed effective in reducing tracking error since control schemes #2, 
#3 or #4 all include a reinforcement Q-learning controller. In particular, the proposed control scheme (i.e. control scheme) 
has the best performance among the four tested control schemes. 

Fig. 8 shows the values of the learning rate of the reinforcement Q-learning controller in the proposed control scheme varies 
with respect to time. After three iterations (after 28.5 seconds), the learning rate only changes slightly since the tracking error 
becomes very small after three iterations. 


V. Conclusion 

This paper has proposed a motion control scheme consisting of two ILCs and one reinforcement Q-learning controller for 
contour following accuracy improvement. In particular, one ILC is used to tune the learning rate of the reinforcement Q- 
learning controller that is mainly used to cope with system nonlinearities, while the other ILC is exploited to suppress 
periodic disturbance during repetitive contour following motions. Results of contour following experiments conducted on a 
bi-axial motion stage indicate that the proposed control scheme is feasible and outperforms other control schemes also tested 
in the experiment. 
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